Posters - Schedules

Poster presentations at ISMB/ECCB 2021 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19 and no later than July 23. All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2021. There are Q&A opportunities through a chat function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.

Information on preparing your poster and poster talk are available at: https://www.iscb.org/ismbeccb2021-general/presenterinfo#posters

Ideally authors should be available for interactive chat during the times noted below:

Posters Home

View Posters By Category

Session A: Sunday, July 25 between 15:20 - 16:20 UTC	Session B: Monday, July 26 between 15:20 - 16:20 UTC
3DSIG Bio-Ontologies BioVis HitSeq Special Session 01 TransMed	3DSIG Bio-Ontologies BioVis CompMS Education EvolCompGen HitSeq NetBio RegSys TransMed

Session C: Tuesday, July 27 between 15:20 - 16:20 UTC	Session D: Wednesday, July 28 between 15:20 - 16:20 UTC
3DSIG BioVis CompMS Education EvolCompGen HitSeq NetBio RegSys TransMed	CAMDA Education EvolCompGen Function iRNA MICROBIOME MLCSB RegSys Text Mining

Session E: Thursday, July 29 between 15:20 - 16:20 UTC
BIOINFO-CORE BOSC CAMDA COVID-19 EvolCompGen Function iRNA MICROBIOME MLCSB RegSys Special Session 05 SysMod Text Mining VarI General Comp Bio

A graph representation of the individual exome variation with evidence from biomedical text corpora.

COSI: Text Mining

Ioannis Giannoulakis, Foundation for Research and Technology, Greece
Alexandros Kanterakis, Foundation for Research and Technology, Greece
George Potamias, Foundation for Research and Technology, Greece
Ioannis Iliopoulos, University of Crete School of Medicine, Greece

Short Abstract: One of the most crucial steps in clinical genetics pipelines is variant annotation and prioritization in which we attempt to enrich an individual’s variation with information from public genomic databases. Despite the plethora of available methods for information extraction from biomedical text, they rarely take part in the annotation/prioritization step of typical NGS pipelines. This is because existing methods are not suited for mass query of the complete genome variation of an individual. Here we present VCF2PHEN, an open tool that builds a graph from the BioC corpus comprising of all open and extensively pre-annotated PubMed articles in less than 10 hours. In this graph nodes represent Articles (n=19M), Chemicals (n=350K), Diseases (n=11K), Genes (n=37K), Mutations (n=422K) and Transcripts (n=127K), interconnected through 106 million edges. All mutations have been homogenized and validated through VariantValidator. The graph can be queried and explored through the Cypher language that is served and visualized through the Neo4j graph database engine. Through this engine we can query the entirety of variants (~100K) identified in NGS experiments in a practical timescale. The result of this query is a personalized graph containing all existing bibliographic evidence linking the individual’s genetic profile with known diseases and chemical/drug interactions.

A machine learning framework for discovering and enriching metagenomics metadata from open access research articles

COSI: Text Mining

Maaly Nassar, EMBL-EBI, United Kingdom
Robert D Finn, EMBL-EBI, United Kingdom
Johanna McEntyre, EMBL-EBI, United Kingdom

Short Abstract: Metagenomics is a culture-independent approach for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally and/or taxonomically), either from a longitudinal study or between independent studies can provide clues into how the microbiota have adapted to a particular environment. However, to understand the impact of environmental factors on the microbiome, it is important to also account for experimental confounding factors. Metagenomics databases, such as MGnify , provide analytical services to enable the consistent functional and taxonomic annotations to mitigate bioinformatic confounding factors. However, a recurring challenge is that key metadata about the sample (e.g. location, pH) and molecular methods used to extract and sequence the genetic material are often missing from the sequence records. Nevertheless, this missing metadata may be found in publications describing the research. When identified, the additional metadata can lead to a substantial increase in data reuse and greater confidence in the interpretation of observed biological trends. Here, we describe a machine learning framework that automatically extracts relevant metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework includes 3 processes: (1) literature classification and triage, (2) named entity recognition (NER) and (3) database enrichment.

Analyzing the information content of text-based files in supplementary materials of biomedical literature

COSI: Text Mining

Nona Naderi, HES-SO Genève/HEG/University of Applied Sciences Geneva, Swiss Institute of Bioinformatics (SIB), Switzerland
Anaïs Mottaz, HES-SO Genève/HEG/University of Applied Sciences Geneva, Swiss Institute of Bioinformatics (SIB), Switzerland
Douglas Teodoro, HES-SO Genève/HEG/University of Applied Sciences Geneva, Swiss Institute of Bioinformatics (SIB), Switzerland
Patrick Ruch, HES-SO Genève/HEG/University of Applied Sciences Geneva, Swiss Institute of Bioinformatics (SIB), Switzerland

Short Abstract: We present an analysis of supplementary materials of PubMed Central (PMC) articles and show their importance in indexing and searching biomedical literature, in particular for the emerging genomic medicine field. On a subset of articles from PubMed Central, we use text mining methods to extract MeSH terms from abstracts and from text-based supplementary materials, such as spreadsheets and doc(x). We find that the recall of MeSH annotations increases about 5.9 percentage point (+20% on relative percentage) by considering supplementary materials compared to using only abstracts. We further compare the supplementary material annotations with annotations found in the article's full-text and we find out that the recall of MeSH terms increases by 1.5 percentage point (+3% on relative percentage). Additionally, we analyze genetic variant mentions in abstracts and full-texts and compare them with mentions found in text-based files in the supplementary materials. We find that the majority of variants (about 99%) are found in text-based files of supplementary materials. Our study also highlights which types of information appear in spreadsheets that are often missing in abstracts. In conclusion, we suggest that supplementary data should receive more attention from the information retrieval community, in particular in life and health sciences.

Annotation of single-cell transcriptomes using NLP-derived cell-type marker genes

COSI: Text Mining

Anna Yannakopoulos, Michigan State University, United States
Stephanie Hickey, Michigan State University, United States
Arjun Krishnan, Michigan State University, United States

Short Abstract: Single-cell RNA-sequencing allows us to measure gene expression levels in thousands of individual cells from a heterogeneous tissue sample simultaneously, but assigning a cell-type label to each single-cell transcriptome after sequencing is challenging. Researchers often use known cell-type marker genes to make these assignments, but curating lists of marker genes from the scientific literature is time consuming, and inconsistent marker gene lists from different research groups hinder reproducibility. We hypothesize that natural language processing (NLP) can be used to identify useful markers for thousands of cell types in an unbiased manner. To test this hypothesis, we leveraged millions of PubMed abstracts to generate numerical vector representations of ~15k ENSEMBL genes and each of the thousands of cell types described in the Cell Ontology. We then used supervised and unsupervised methods to predict the relationships between genes and cell-types, giving us a score for each gene/cell-type pair. To ensure the scores were cell-type-specific, they were normalized among groups of related cell types. We found the top ranked normalized NLP markers outperformed hand-curated markers when identifying PBMC cell types, providing a proof of principle that NLP approaches can create unbiased lists of cell-type-specific marker genes useful for annotating single-cell RNA-seq data.

Critical Assessment of Transformer-based Models for German Clinical Data

COSI: Text Mining

Manuel Lentzen, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Germany
Sumit Madan, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Germany
Vanessa Lage-Rupprecht, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Germany
Holger Fröhlich, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Germany

Short Abstract: Medical research can benefit from information in electronic health records, but, as they often exist as unstructured free text, processing with machine learning tools is challenging. Transformer-based models like BERT represent a promising approach to tackle this issue, as they achieve state-of-the-art results in many domains; however, applications in the biomedical context focus mainly on the English language. While German language models such as GermanBERT and gottBERT are available, domain-specific models for biomedical data are yet to be developed.
In this study, we critically assessed the suitability of existing and new models for the biomedical domain. We used five German language models, pre-trained a new model on a newly-assembled biomedical corpus, and compared them with each other. For the evaluation, we annotated a new dataset of clinical documents and used it alongside two other corpora (GGPONC and JSynCC) for named-entity recognition and sequence classification.
Despite the small corpus available for pre-training, the domain-specific model provided better prediction performances than an existing rule-based system. However, unspecific German language models were not outperformed by domain-specific ones, suggesting such models as a first opportunity for the German-speaking region. Higher performances of domain-specific models might be achievable if larger corpora for pre-training were available.

OntoSemantics: A prototype for a Searchable Semantic Network over Biomedical Texts

COSI: Text Mining

Joseph Bonello, University of Malta, Malta
Matthew Drago, University of Malta, Malta
Ernest Cachia, University of Malta, Malta

Short Abstract: The curation of biomedical texts is an essential task in the biomedical field. Automating the curation process can go a long way in improving the accuracy of annotations that are deposited in widely used resources.

We are proposing a new method for annotating abstracts and biomedical texts, OntoSearch. OntoSearch exploits the controlled vocabularies of various ontologies to identify terms and events in literature that are used to create a searchable Semantic Network.

OntoSearch also uses a number of Natural Language Processing (NLP) rules to identify the terms from the vocabularies in the abstracts which are marked as entities in the annotated output. The relationships are stored in a Graph structure which allows OntoSearch to deduce complex interactions.

OntoSearch is evaluated against the abstracts in BioCreative. The first prototype achieved an F-Score of 35%, where the average F-Score of the top performing methods is 45% and the range of the F-scores for the competing methods is between 32% and 57%.

We plan to improve the capability of OntoSearch by refining the NLP rules to capture genes and gene products better. We also plan to use Deep Learning approaches to improve the annotation capability.

Unbiased and Comprehensive NLP-derived Marker Genes for scRNA-seq Cell Type Annotation

COSI: Text Mining

Anna Yannakopoulos, Michigan State University, United States
Stephanie Hickey, Michigan State University, United States
Arjun Krishnan, Michigan State University, United States

Short Abstract: Many cell type annotation methods use labeled reference cell atlases or manually-curated lists of marker genes to infer which cell type a new cluster represents, but these methods can introduce study and selection bias based on what the reference atlases or marker gene lists include. In addition, these methods are often incapable of annotating cells to cell types not found in the reference atlas or the list of marker genes. Here, we propose an approach that uses natural language processing of millions of PubMed abstracts to associate potential marker genes with all of the thousands of cell types described in the Uberon Ontology automatically and without bias. First, we create numerical representations of genes and cell types by embedding them in a shared high-dimensional space based on the text of over 17 million biomedical abstracts in PubMed and the curated hierarchical relationships between cell types in the Uberon Ontology. We then train a deep neural network to associate gene embeddings with ontology-referenced cell types. Our cross-validation results show that our method can extend known marker gene lists to encompass novel gene/cell type relationships, even when the gene and/or cell type has not been previously studied in this context.

Sponsors

Posters - Schedules

View Posters By Category

Session A: Sunday, July 25 between 15:20 - 16:20 UTC

Session B: Monday, July 26 between 15:20 - 16:20 UTC

Session C: Tuesday, July 27 between 15:20 - 16:20 UTC

Session D: Wednesday, July 28 between 15:20 - 16:20 UTC

Session E: Thursday, July 29 between 15:20 - 16:20 UTC

ISCB On the Web